Using HPV-meta for human papillomavirus RNA quality detection

In the era of cervical cancer elimination, accurate and validated pipelines to detect human papillomavirus are essential to elucidate and understand HPV association with human cancers. We aimed to provide an open-source pipeline, “HPV-meta”, to detect HPV transcripts in RNA sequencing data, including several steps to warn operators for possible viral contamination. The “HPV-meta” pipeline automatically performs several steps, starting with quality trimming, human genome filtering, HPV detection (blastx), cut-off settlement (10 reads and 690 bp coverage to make an HPV call) and finishing with fasta sequence generation for HPV positive samples. Fasta sequences can then be aligned to assess sequence diversity among HPV positive samples. All RNA sequencing files (n = 10,908) present in the cancer genome atlas (TCGA) were analyzed. “HPV-meta” identified 25 different HPV types being present in 488/10,904 specimens. Validation of results showed 99.98% agreement (10,902/10,904). Multiple alignment from fasta files warned about high sequence identity between several HPV 18 and 38 positive samples, whose contamination had previously been reported. The “HPV-meta” pipeline is a robust and validated pipeline that detects HPV in RNA sequencing data. Obtaining the fasta files enables contamination investigation, a non very rare occurrence in next generation sequencing.

analysis 18 . The International HPV Reference Center already reported existence of 114 possible chimeras and 13 taxonomy/naming errors when performing a GenBank screening analysis of HPV deposited sequences in 2021 19 and several authors have reported presence of HPV contamination among different studies (e.g. HPV 38 in endometrial samples, HeLa nucleic acid contamination leading to misidentification of HPV 18) 20,21 . Wellvalidated HPV pipelines are needed to prevent database contamination and false calling to continue with quality and order in HPV research.
The present study aimed to develop a robust open-source HPV pipeline, "HPV-meta" (https:// github. com/ hpvce nter/ HPV-meta) to detect HPV transcripts in RNA sequencing data which could further provide information to operators and warn for possible HPV contamination. To achieve this aim, a well-known and validated public database, the cancer genome atlas (TCGA, https:// www. cancer. gov/ tcga) was used for pipeline development and validation purposes. TCGA is a collaboration between the National Cancer Institute and the National Human Genome Research Institute and contains molecularly characterization of over 20,000 primary cancers and matched normal specimens spanning 33 cancer types (https:// www. cancer. gov/ tcga, accessed on 2021-08-10).

Methods
For this study, we downloaded all RNA sequencing (RNA-seq) datasets from the TCGA portal belonging to primary tumors (n = 9774), metastases (n = 394), recurrent tumors (n = 47) and solid normal tissues (n = 693) to develop and validate "HPV-meta", a pipeline aiming to detect HPV transcript in RNA sequencing data. All research was performed in accordance with relevant guidelines/regulations. By using the GDC transfer tool, we downloaded 10,908 bam files comprising around 90 terabyte of sequencing data. For each of the bam files, we downloaded the corresponding metadata, including the specimens and clinical data (sample type -primary tumor, control, metastasis-, ICD-10, tissue or organ of origin and primary diagnosis).
In order to process the large amount of sequencing data, we developed an automated pipeline called "HPVmeta" for HPV detection in RNA sequencing specimens. "HPV-meta'' performs several steps including quality trimming, human genome filtering, HPV detection, cut-off settlement and fasta sequence generation (Fig. 1). The fasta sequence is intended for users to inspect possible contamination. Be aware that the whole complete sequence may not be obtained, as it is RNA data (non-coding regions will not appear if DNA is not present).
The pipeline and all the tools used for the data processing and analysis are openly accessible in the github repository (https:// github. com/ hpvce nter/ HPV-meta). For this study, the code was integrated onto the Spark and Hadoop environment but it can definitely be used in other environments as well.
A detailed description of the different steps performed by "HPV-meta" is provided as follows.
Filtering human genome sequences. This step is recommended when starting with sequencing reads that were already aligned to a human reference genome. If the user starts with raw data (fastq files), this step must be omitted, and files should be subjected to quality filtering directly. A close look at the original bam files included in the TCGA database (n = 10,908) revealed that all bam files were mapped to databases containing a human reference genome as well as other viral sequences. HPV was included in some of those databases. Therefore, the "HPV-meta" pipeline starts by using pysam v. 0.16.1 (htslib/ bcftools v. 1.10.2 and samtools v. 1.10) with a bed file containing human chromosomes and genomic regions to filter out the mapped human reads but not the viral sequences. Non-human reads remain in the newly created bam file (file_name_NH.bam), which is then sorted and converted to a fastq file (file_name_NH.fq.gz) using pysam.
Quality filtering and human re-filtering. To obtain high quality non-human reads, "HPV-meta" uses Trimmomatic v. 0.39 (https:// github. com/ usade llab/ Trimm omatic) 22 . Parameters for Trimmomatic are as follows: -SE -phred 33 -leading 3 -trailing 3-slidingWindow:4:15 and-minlen: 18. In the case of the TCGA, none of the bam files belonged to samples subjected to library preparation kits that needed extra quality trimming steps. However, as library kits are expanding, the "HPV-meta" pipeline includes the step of "extra trimming", using cutadapt version 3.3. As an example, when performing library preparations using the SMARTer® Stranded Total RNA-Seq Kit v2-Pico Input Mammalian library preparation guide (Takara, US), it is required to trim/remove the first 3 nucleotides from every R2 read for the RNA sequencing fastq-files 23 . The operator can choose to use or to omit this step.
High quality reads are afterwards re-screened against the human reference genome version GRCh38 using Nextgenmap v. 0.5.5 24 . The program is run under default settings, except for -i 0.95, -R 0.75 and-silent-clip. Reads that map to the human reference genome (if they align with more than 95% identity over 75% of their length to the human genome) are filtered out, and the corresponding bam files with no human sequences are then sorted and converted to fastq files again (File_ID_non-human_quality.fq). We highly recommend filtering human sequences twice when users start with bam/sam files coming from public databases as the human reference genomes used previously may not be the latest update.
HPV detection and cut-offs. High quality non-human reads are queried against all HPV protein sequences included in the PaVE database (Papillomavirus Episteme, https:// pave. niaid. nih. gov/, accessed on 2021-08-17), including all protein sequences from HPV reference and non-reference genomes), using the open source software Diamond v. 2.0.7 25 using blastx with default parameters and -top 1.
Positivity of HPV is only considered if at least 10 reads are present for one HPV type and 690 bp of its genome are covered. The cut-off is based on previous validated cut-offs used for HPV detection on total nucleic acid extracted specimens (cut-off 750 bp, ~ 10% of total genome) 7,8,18 . As the HPV genome comprises around 8000 bp and approximately 1000 of them are non-coding, the "HPV-meta" pipeline has reduced the number of bp to be covered (from 750 to 690 bp), in order to maintain a similar % of minimum coverage for positivity calling. The www.nature.com/scientificreports/ "HPV-meta" pipeline calculates the coverage for each sequencing file and reports the total coverage as well as the coverage for each HPV gene. HPV detection will only be reported for one HPV type per specimen. In the case of multiple HPV infection within a sample, the HPV type with the highest number of unique reads will be reported. In the case that 2 HPV types present the same number of unique reads, the type which presents a higher coverage is selected.
HPV fasta files. The "HPV-meta" pipeline generates a consensus fasta file for each HPV type classified as positive within a specimen. High quality non-human reads fastq files (File_ID_non-human_quality.fq) are queried against a database of known HPV sequences including all HPV genomes officially established by the International HPV Reference Center (n = 222 officially established HPV types, https:// www. hpvce nter. se, accessed on 2021-08-17), together with complete genome sequences from HPV types that are not officially established yet (n = 222, https:// pave. niaid. nih. gov, accessed on 2021-08-17), using NextGenMap 24 under the same settings as described above. For each specimen, reads mapping to the HPV genotype previously detected with diamond are filtered for further analysis.
Resulting BAM files are then processed through a quality control and left aligned using the The Genome Analysis Toolkit (GATK) v. 4.2.3.0, LeftAlignIndels module. The HPV genomes are then genotyped by GATK HaplotypeCaller 26 SNP and indel calls are made and hard filtered, following GATK Best Practices. All variant  22 extra trimming (needed for specific library preparation kits, e.g: removing 3 bp of R2 from libraries prepared with the Smarter stranded total RNA-seq kit from Takara, USA) performed with Cutadapt v. 3.3) 23 re-mapping to human reference genome (double human cleaning using Nextgenmap v. 0.5.5) 24 filtering out of human reads, mapping non-human reads to an HPV protein database (Diamond v. 2.0.7) 25 coverage calculation and, if HPV positivity is present, a fasta file is generated by mapping the reads to HPV genome references and subjecting them to variant calling using GATK v. 4 Analysis of sequence identity. It is well reported in literature that for each HPV type, extensive sequence diversity is observed when analyzing HPV positive samples. Around 84.5% of women infected with HPV isolates do carry unique HPV sequences within their cervical samples 27,28 . With the generated fasta sequences, operators can for example compare the sequence identities obtained for one HPV type in one sample, with the rest of sequences presenting the same HPV type but detected in other specimens. The present study includes an analysis of sequence identity for all HPVs that were present in at least 30 samples (HPV 16, 18 and 38).
We would like to note that 100% identity is not a final proof of contamination, as different women may be infected with the same HPV type and identical sequence. However, isolates showing 100% nucleotide identity should be considered as "flagged" for the operator in order to have a closer look if needed.

Results
A total of 10,908 bam files were downloaded from the TCGA database, and subjected to HPV detection using the "HPV-meta" pipeline. Four sequencing files (all corresponding to primary tumors) could not be processed due to failure/corruption in the original bam files and therefore the analysis was made in 10,904 sequencing files (9770 primary tumors, 394 metastases, 47 recurrent tumors and 693 solid normal tissues).

HPV detection validation.
Results from HPV type detection obtained with the "HPV-meta" pipeline were compared to the HPV mapping hits present in the original bam files from the whole TCGA collection for validation purposes (as the database used in the alignment already contained sequences from most HPV types).
All HPV types detected by "HPV-meta" were confirmed within the specimens, except for 2 samples where the genotypes detected with both approaches did not agree. One specimen (cervical primary tumor) showed HPV 52 in the original bam file while the "HPV-meta" revealed HPV 58. Further analysis on original bam files revealed presence of both genotypes, with HPV 52 showing 23 reads and HPV 58 19 reads. The "HPV-meta" pipeline revealed presence of HPV 52 (16 unique reads) and HPV 58 (16 unique reads) and HPV 58 was selected as the dominant HPV type due to higher coverage (690 nt). The other specimen (ovarian primary tumor) showed HPV-mCG3 in the original bam file (15 reads) while the "HPV-meta" revealed presence of HPV-mSK041 (21 reads, coverage 1119 nt). All specimens classified as negative by the "HPV-meta" pipeline have also been reported negative in the TCGA studies. HPV coverage. The "HPV-meta" pipeline provided HPV coverage for all HPV positive specimens, revealing that HPV16 showed the highest median coverage when detected in gum, upper gum, oropharynx, base of tongue and tonsil specimens (6870, 6840, 6625, 6231 and 6148 nt covered, respectively, which corresponds to 86.90-77.76% of total genome coverage) followed by HPV 6 detected in anterior wall of bladder (6162 nt covered, 77.06% of total genome coverage), when compared to other HPV types and locations (Supplementary Table S1).
Sequencing reads from a total of 22 organs/tissues of origin where HPV positivity was detected, showed less than 2000 nt coverage (about < 25% total genome coverage) for all HPV types detected at these locations: brain and cerebrum, liver, colon (including ascending, transverse, descending, sigmoid and unspecified), corpus uteri, mouth and floor of mouth, posterior wall of oropharynx, cecum, cardia, thyroid gland, skin, pleura and lower lobe of lung, head of pancreas, prostate gland, ovary and connective, subcutaneous and other soft tissues of lower limb and hip (Supplementary Table S1).
The Supplementary Table S1 (available online) provides the number of total HPV reads and the HPV median coverage (both total and gene coverage), for each HPV type detected in a different tissue/organ location from all RNA sequencing data from TCGA.

Discussion
The present study described "HPV-meta", the first open-source pipeline aiming to specifically detect HPV transcripts in RNA sequencing data and provide further HPV type sequences to inspect for possible viral contamination among samples. We report the description and validation of the pipeline using the TCGA database including almost 11,000 RNA sequencing files.
Strengths of the study include: (a) Using TCGA database, a well validated (as well as exploited) database, that provided no less than 10,904 RNA sequencing files from 137 different organs, spanning 33 cancer types. Open science, data sharing and collaboration is essential for understanding any research question; (b) screening against an updated and complete HPV database including all officially established HPV types as well as the non-official types; and (c) the possibility of generating fasta files (obtain the HPV sequence) that can be used to compare Table.1. HPV types detected in RNA sequencing data from TCGA using "HPV-meta". HPV types detected in TCGA RNA sequencing files. Number in brackets corresponds to number of specimens for each organ/tissue of origin. Cervix uteri (7)  7 HPV39 Cervix uteri (5)  www.nature.com/scientificreports/ sequence identity among samples presenting the same HPV type within a study. This is due to the singularity of the HPV genome, mostly covered by transcribing ORFs (> 85% of total HPV genome is covered by coding genes). Weakness of the study may include the single HPV result obtained from the pipeline. The pipeline was designed to report only one HPV type in case of multiple HPV infection. However, the information about the different HPV types detected, the number of unique reads and the coverage is saved for every positive HPV. The operator may choose to change the code in order to report all HPV types detected. Furthermore, "HPV-meta" targets only HPV (no other viruses and/or microorganisms), however, changing the database and including the desired microorganisms´ sequences may enable the pipeline to potentially detect other microorganisms.
"HPV-meta" identified HPV in 488/10,904 specimens and HPV was present in 46/137 organs. For some cancer forms, like cervical cancers, it is already known that they will regularly contain HPV transcripts from the E6 and E7 viral oncogenes and these cancer forms did serve as a positive control in this project. Organs with HPV-associated cancers showed the highest number of specimens positive for HPV as expected (cervix uteri (n = 291/309), endometrium (n = 45/561) and tonsils (n = 34/41)), but HPV was also detected in non-HPV associated tumors (e.g. 7/683 cerebrum/brain tumors). To validate the results, we examined all original bam files from the TCGA as they had the HPV mapping information and compared those results with the results from "HPV-meta". There was almost a 100% agreement (99.98%) with only 2/488 HPV positive samples not reporting the same genotype. The disagreement of one of them (cervical tumor) could be explained due to a multiple www.nature.com/scientificreports/ infection of 2 genotypes presenting the same number of reads in the same sample. The second disagreement occurred in one ovarian tumor where 2 different non-officially HPVs were reported. While the type detected in the original bam files corresponded to HPV-mCG3 (non-official HPV sequence submitted in 2012), the "HPVmeta" revealed presence of HPV-mSK041 (non-official HPV sequence submitted in 2018, and therefore not present in the original database). It can be thus confirmed that "HPV-meta" could detect all HPVs with no obstacle. Possible sequencing contamination is a weakness from NGS and operators must be highly experienced when performing sequencing protocols. Possible contamination within samples has already been reported for TCGA specimens (HPV 38 in endometrial samples and HeLa nucleic acid contamination leading to misidentification of HPV 18 in non-cervical cancers) 20,21 . In agreement with these authors, HPV 38 was detected in 41 samples, and all of them belonged to the same study (endometrium), which should cause at least suspicion about possible contamination. Fasta generated alignments and pairwise comparisons showed total sequence identity among most of the sequences (> 80%) for HPV type 38, raising a flag for possible contamination. The same scenario occurred in some TCGA studies for HPV 18. Use of positive controls may be necesary when performing NGS sequencing, but negative controls are a must.
In conclusion, the "HPV-meta" pipeline has shown to be an open-source pipeline which accurately detects HPV in RNA sequencing data with validated cut-offs. Obtaining the fasta files adds the possibility of contamination investigation, a non-very rare occurrence in next generation sequencing which should always be considered.

Data availability
All files used for this study are available at the public database the cancer genome atlas (TCGA). "HPV-meta" is publicly available at https:// github. com/ hpvce nter/ HPV-meta.